Explore & Summarize Data by Yasser Arafath

This data set cosists of the fatal accidents which occured in US from the year 1998 to 2011. The variables present are Age, State, No of Airbags in the car, if Airbags were deployed, the injury level for driver, etc. There are many missin values, these are handled by creating subsets of the dataset.

Summary of the dataset:

##       year              caseid           state            age        
##  Min.   :1998   1.031956019:    36   Min.   : 1.00   Min.   :  0.00  
##  1st Qu.:2000   1.038900463:    36   1st Qu.:12.00   1st Qu.: 19.00  
##  Median :2003   1.013900463:    35   Median :27.00   Median : 26.00  
##  Mean   :2004   1.011122685:    34   Mean   :27.23   Mean   : 34.85  
##  3rd Qu.:2006   0.961122685:    33   3rd Qu.:42.00   3rd Qu.: 48.00  
##  Max.   :2010   0.993761574:    33   Max.   :56.00   Max.   :100.00  
##                 (Other)    :150951                   NA's   :2762    
##       sex          D_injury    D_airbagAvail D_airbagDeploy
##  Min.   :1.00   Min.   :0.00   no  :45087    no  :78480    
##  1st Qu.:1.00   1st Qu.:1.00   yes :98839    yes :51582    
##  Median :2.00   Median :3.00   NA's: 7232    NA's:21096    
##  Mean   :1.51   Mean   :2.47                               
##  3rd Qu.:2.00   3rd Qu.:4.00                               
##  Max.   :2.00   Max.   :5.00                               
##  NA's   :879

A histogram is created to see the number of accidents caused by the age groups

This histogram shows that the people between age 15-25 cause the most number of accidents

A histogram is created to see the number of accidents caused every year from 1998-2011

This histogram shows that the number of accidents caused each year is decreasing.

A histogram to see which sex caused the most number of accidents

It is seen that females are marginally greater in count compared to males

A histogram to see if the cars were equipped with airbags

It is seen that majority of the cars were equipped with airbags & still faced a fatal acciddent

A histogram to see the deployment of airbag during the accident

The airbags wernt deployed in majority of the accidents, from which we can infer that airbags play an important role in safety

A histogram to see the deployment of airbag in the airbag available cars

This graph is to show the ratio of how the airbags were deployed in the airbag available cars

A histogram to see the number of accident in every State

The highest number of accidents were caused in State “6”

Univariate Analysis

What is the structure of your dataset?

This dataset is a list of all the fatal accidents which occured in the US from the year 1998 to 2011.

What is/are the main feature(s) of interest in your dataset?

The main feature of this dataset is that it shows the number of accidents happened in each year from 1998 to 2011 by sex.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The cause of the accident & the time of occurance will help in the investigation further.

Did you create any new variables from existing variables in the dataset?

Yes, a subset of the dataset without the ‘NA’ values for ‘sex’ & ‘Age’ was created. And a subset of data having only cars with airbag was created.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

There wernt any unusuall distributions, the dataset was clean already.

Bivariate Plots Section

These are graphs obtained by comparing 2 variables

We first group the values by age and sex

Accident data are first seperated by sex

A bargraph of accidents by age is created only for Men

A bargraph of accidents by age is created only for Women

The graphs are put side by side for a better comparision

A histogram is created comparing the age with the mean of the injury values

Age vs Injury

A scatterplot is created comparing the sex with the mean of the injury values

Age vs Injury by Sex

A boxplot is created comparing the age with the injury values

A boxplot to compare Age vs Injury by Sex

From the previous histogram, it was seen that the states ‘6’, ‘48’ & ‘12’ had the highest number of accidents.

Here We create seperate bar graphs state wise to show the number of accidents in the top 3 states by age group

A grid is created to show the top states which had the highest number of accidents

A boxplot is created comparing the age with the Airbag Availability

A boxplot to compare Age vs Airbag Availability

Bivariate Analysis

From the 1st histogram it is observed that mean of injury is uniform across the various age groups, except for ages above 90 & From the scatterplot, men from various age groups suffered higher injury compared to women of that age group

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

In this part, the injury level of the victim was compared to the age. It was found that it did not vary much across the age groups.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Injury level between the sex was observed, it was found that male were prone to higher injury than the female victim in the 20-50 age group.

What was the strongest relationship you found?

People from age group 75-85 suffered the most injury

Multivariate Plots Section

A Graph is generated to compare Age vs Sex vs Year

From this graph, we can see that the blue color is on the top portion of the graph and brown is spread at the bottom of the graph. i.e Females with age higher than 70 face more accidents compared to men throughout the years.

A Boxplot to represent how every year each sex face the accident

This is another detailed representation of how every year each sex face the accident

A Histogram to see which age group met with the highest number of accident in top 3 states by age group

Number of accidents by Age group in Top 3 states

A Boxplot to see the level of Injury by age in ascending order

A boxplot to indicate the injury level in order

A level plot to compare Injury lever and age by year

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

From this Multivariate Analysis, we observed that the majority of victims in age group 70-90 were females and in the age group 20-50 the majority is males

Were there any interesting or surprising interactions between features?

Three were nothing surprising or interesting


Final Plots and Summary

Plot One

This is one of the importang histograms as this shows which age group faced the most faality. From this histogram, we can infer that people from age 15-25 faced the most number of accidents.

Plot Two

This Graph shows how much injury different sex faced in the crash. Males from age 20-50 faced severe injury compared to females. and females from age 70-90 faced higher injury rate than males of that group

Plot Three

From this Multivariate Analysis, we observed that the majority of victims in age group 70-90 were females and in the age group 20-50 the majority is males

Reflection

So from this dataset, many observations were made:

1. People from age 15-25 faced the most number of accidents

2. The number of accidents decreased as years passed

3. No of incidents where airbag was deployed - 78480, airbag not deployed - 51582

4. Females were slightly higher than the males

5. Injury suffered by Males were higher than Females in the age group 20-50

6. Injury suffered by Females were higher than Males in the age group 70-90

There were many ‘NA’ data for Age & Sex columns, a subset was created omitting th ‘NA’ values. There were no further challenges in the dataset, evertything else was well sorted out.

In the future, we can reduce the number of accidents by further deeply analysing this dataset by predicting the time of accident and the place it happens the most. It was seen that many cars were without airbags, with this stats, we can implement many safety regulations for th betterment of the drivers & passengers.